[PyTorch] Pad V when Q/V head dims differ (MLA) for THD by HollowMan6 · Pull Request #2629 · NVIDIA/TransformerEngine

HollowMan6 · 2026-01-27T23:31:21Z

Description

For MLA, we shall pad V when Q/V head dims differ for THD

Similar to NVIDIA/Megatron-LM#3003

Fixes NVIDIA/Megatron-LM#1698

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

pad V when Q/V head dims differ for THD

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-01-27T23:34:47Z

Greptile Summary

This PR fixes a bug in THD-format MLA (e.g. DeepSeek V3) where FlashAttention 2 was entirely blocked for mismatched Q/K and V head dimensions. It introduces zero-padding of V (and optionally Q/K) up to max(head_dim_qk, head_dim_v) before the FA2 call and trims the output back to the original V head dimension afterward, while also tightening the FA2 head-dim validity check to account for the padded dimension.

Adds _pad_qkv_head_dim and _trim_output helpers; applies them in the FA2 branch when head_dim_qk != head_dim_v and the backend is FA2 (guards correctly skip FA3/FA4 which support MLA natively).
Removes the blanket FA2-disable guard for mismatched head dims in utils.py and replaces it with a fa2_padded_head_dim-based validity check that also enforces a >192 restriction on older architectures.

Confidence Score: 3/5

The non-FP8 THD MLA path works correctly, but the removed FA2 guard combined with the Float8TensorStorage exclusion from padding leaves the Float8 + MLA + FA2 combination unprotected — FA2 would receive tensors with mismatched head dimensions.

The removed blanket FA2 guard for head_dim_qk != head_dim_v in utils.py is not fully compensated by the padding logic in dot_product_attention.py, which skips Float8TensorStorage inputs. If a Float8 MLA configuration reaches FA2, it will call FA2 with unpadded mismatched head dims, causing a crash or incorrect results. The same guard previously covered this case safely.

Both changed files interact to create the regression: utils.py removes the guard that blocked FA2 for all mismatched-head-dim cases, while dot_product_attention.py adds padding but excludes Float8 tensors.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	Adds _pad_qkv_head_dim/_trim_output helpers and applies FA2 V-padding for MLA. The Float8TensorStorage exclusion from padding combined with the removed FA2 guard creates a regression where Float8 + MLA + FA2 reaches FA2 with unpadded mismatched head dims.
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Removes FA2 MLA blanket guard and replaces it with a padded-head-dim validity check; adds >192 restriction for older architectures. Logic is correct for the non-Float8 path; the Float8 regression stems from the interaction with dot_product_attention.py.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["DotProductAttention.forward - MLA head_dim_qk != head_dim_v"] --> B{use_flash_attention?}
    B -- No --> C[FusedAttention or Unfused]
    B -- Yes --> D{backend == FA2 version?}
    D -- No --> E[FA3/FA4 support MLA natively - no padding needed]
    D -- Yes --> F{value is Float8TensorStorage?}
    F -- Yes --> G["Skip padding - FA2 receives mismatched head dims - potential crash"]
    F -- No --> H[_pad_qkv_head_dim - pad V to head_dim_qk]
    H --> I[flash_attention with padded Q/K/V]
    I --> J{orig_qk_dim > orig_v_dim?}
    J -- Yes --> K[_trim_output - slice back to orig_head_dim_v]
    J -- No --> L[Return attn_out as-is]
    K --> M[Correct output]
    L --> M

Comments Outside Diff (1)

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1657-1668 (link)

Float8 MLA with FA2 receives unpadded mismatched tensors

The old guard in utils.py unconditionally disabled FA2 for every head_dim_qk != head_dim_v case, including Float8. Now that guard is gone, FA2 can be selected for Float8 MLA. But the padding block here excludes Float8TensorStorage, so FA2 is called with the original mismatched dims — a crash or silent corruption that the old guard prevented.

Either restore the disabled-FA2 guard in utils.py specifically for the Float8 + mismatched-head-dim case, or drop the not isinstance(value_layer, Float8TensorStorage) exclusion here so Float8 tensors get padded along the same path (if F.pad supports Float8TensorStorage).

_{Reviews (9): Last reviewed commit: "Support when v is larger than qk" | Re-trigger Greptile}

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

Copilot

Pull request overview

This PR adds support for Multi-head Latent Attention (MLA) with mismatched Q/V head dimensions in the THD (Total-Hidden-Dimension) format. When the value tensor has a smaller head dimension than the query/key tensors, the code pads the value tensor to match the Q/K head dimension, runs the attention operation, and then trims the output back to the original V dimension.

Changes:

Added padding logic for V tensor when head dimensions differ in THD format
Implemented trimming function to restore correct output dimensions after attention
Added test case for THD attention with mismatched Q/V head dimensions

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	Implements padding of V tensor before attention and trimming of output after attention for THD format with mismatched Q/V head dimensions
tests/pytorch/attention/test_attention.py	Adds test case to verify THD attention works with different Q/V head dimensions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cyanguwa · 2026-04-22T22:28:34Z

This change should only be required by the FlashAttention backend. The other two backends FusedAttention and UnfusedDPA do support MLA (head_dim_qk != head_dim_v). I'd propose a few changes:

move the F.pad() code to if use_flash_attention: branch, and only call _trim_thd_output in that branch
move _trim_thd_output to the top of the dpa file
enable the MLA support for FlashAttention:

TransformerEngine/transformer_engine/pytorch/attention/dot_product_attention/utils.py

Line 706 in 3c62f42

if head_dim_qk != head_dim_v:
remove the newly added test since it'll be covered by the test_dpa_qkv_layout_thd test already once the change in utils.py is done:

TransformerEngine/tests/pytorch/attention/test_attention.py

Line 1021 in 3c62f42

def test_dpa_qkv_layout_thd(dtype, model_configs, model, qkv_layout):

@vcherepanov-nv, could you help push this PR through the finish line? Thanks!

HollowMan6 · 2026-04-28T14:14:20Z

Thank you @cyanguwa, I just cleaned up the PR and also follow your requirements. Please let me know what you think @vcherepanov-nv.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cyanguwa · 2026-06-01T15:17:39Z

/te-ci pytorch L0

cyanguwa · 2026-06-03T00:58:38Z

/te-ci pytorch L0

cyanguwa

@HollowMan6, could you help fix the failed tests please? Sorry, it's an oversight on my side too. Right now, _pad_value_layer and _trim_output both assume that V has a shorter head_dim than Q/K, but it could happen the other way as well.

// failed tests: "mla_1_0", "mla_1_1"

TransformerEngine/tests/pytorch/attention/test_attention.py

Lines 568 to 569 in 5535b09

    
           "mla_1_0": ModelConfig(8, 128, 16, 64, head_dim_v=128), 
        
           "mla_1_1": ModelConfig(4, 128, 16, 64, max_seqlen_kv=256, head_dim_v=128),

// failed error:
https://github.com/Dao-AILab/flash-attention/blob/d80a77103021c4e980f8cbbf85774f6a19e6474a/csrc/flash_attn/flash_api.cpp#L418

I wonder if we can make the pad function look something like this:

def _pad_qkv_head_dim(query_layer, key_layer, value_layer):
return new_q, new_k, new_v, orig_head_dim_qk, orig_head_dim_v

Also, only call _trim_output on padded_head_dim_v > orig_head_dim_v; otherwise, a no op.

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 · 2026-06-04T17:11:27Z

Thank you for pointing this out @cyanguwa, originally I didn't handle this v > qk as this is not a practice for MLA, but since test cases cover this, I have just pushed the changes accordingly.

cyanguwa · 2026-06-04T22:04:13Z

/te-ci pytorch L0

Copilot AI review requested due to automatic review settings January 27, 2026 23:31

Copilot started reviewing on behalf of HollowMan6 January 27, 2026 23:31 View session

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

Comment thread tests/pytorch/attention/test_attention.py Outdated

Copilot AI reviewed Jan 27, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

Comment thread tests/pytorch/attention/test_attention.py Outdated

HollowMan6 force-pushed the mla_thd branch from d8b40c5 to f9d1f5c Compare January 27, 2026 23:49

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

HollowMan6 force-pushed the mla_thd branch from f9d1f5c to c04188f Compare February 18, 2026 09:51

greptile-apps Bot reviewed Feb 18, 2026

View reviewed changes

HollowMan6 mentioned this pull request Feb 23, 2026

[megatron] fix: patch support newer mcore version verl-project/verl#5372

Merged

8 tasks

HollowMan6 force-pushed the mla_thd branch from c04188f to 5b8ff61 Compare March 3, 2026 19:17

ptrendx added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label Apr 9, 2026

ptrendx assigned cyanguwa Apr 9, 2026

cyanguwa requested a review from vcherepanov-nv April 22, 2026 22:21

cyanguwa added the 2.16.0 label Apr 22, 2026

HollowMan6 force-pushed the mla_thd branch from 5b8ff61 to c3273cb Compare April 28, 2026 14:13

HollowMan6 requested a review from Copilot April 29, 2026 21:27

Copilot started reviewing on behalf of HollowMan6 April 29, 2026 21:28 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py

Comment thread transformer_engine/pytorch/attention/dot_product_attention/utils.py

HollowMan6 force-pushed the mla_thd branch from c3273cb to 490d29f Compare May 10, 2026 23:32

HollowMan6 force-pushed the mla_thd branch from 490d29f to fdebd2f Compare May 24, 2026 04:20

HollowMan6 requested a review from cyanguwa as a code owner May 24, 2026 04:20

github-actions Bot added the org-contribution label May 24, 2026

KshitijLakhani removed the 2.16.0 label May 27, 2026

cyanguwa reviewed Jun 1, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

HollowMan6 requested a review from cyanguwa June 1, 2026 17:23

cyanguwa approved these changes Jun 3, 2026

View reviewed changes

cyanguwa requested changes Jun 4, 2026

View reviewed changes

HollowMan6 added 3 commits June 4, 2026 10:06

[PyTorch] Pad V when Q/V head dims differ (MLA) for THD

2bf3a58

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Address review suggestions

36cdcab

Signed-off-by: Hollow Man <hollowman@opensuse.org>

Support when v is larger than qk

1ff200f

Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 force-pushed the mla_thd branch from b08f0e1 to 1ff200f Compare June 4, 2026 17:08

cyanguwa approved these changes Jun 5, 2026

View reviewed changes

cyanguwa merged commit 8a5af97 into NVIDIA:main Jun 5, 2026
20 of 25 checks passed

HollowMan6 deleted the mla_thd branch June 5, 2026 18:02

	"mla_1_0": ModelConfig(8, 128, 16, 64, head_dim_v=128),
	"mla_1_1": ModelConfig(4, 128, 16, 64, max_seqlen_kv=256, head_dim_v=128),

Conversation

HollowMan6 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cyanguwa commented Apr 22, 2026

Uh oh!

HollowMan6 commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyanguwa commented Jun 1, 2026

Uh oh!

cyanguwa commented Jun 3, 2026

Uh oh!

cyanguwa left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HollowMan6 commented Jun 4, 2026

Uh oh!

cyanguwa commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HollowMan6 commented Jan 27, 2026 •

edited

Loading

greptile-apps Bot commented Jan 27, 2026 •

edited

Loading

cyanguwa left a comment •

edited

Loading